Representation Quality in Text Classification: An Introduction and Experiment
نویسنده
چکیده
The way in which text is represented has a strong impact on the performance of text classification (retrieval and categorization) systems. We discuss the operation of text classification systems, introduce a theoretical model of how text representation impacts their performance, and describe how the performance of text classification systems is evaluated. We then present the results of an experiment on improving text representation quahty, as well as an analysis of the results and the directions they suggest for future research. 1 T h e T a s k o f T e x t C l a s s i f i c a t i o n Text-based systems can be broadly classified into classification systems and comprehension systems. Text classification systems include traditional information retrieval (IR) systems, which retrieve texts in response to a user query, as well as categorization systems, which assign texts to one or more of a fixed set of categories. Text comprehension systems go beyond classification to transform text in some way, such as producing summaries, answering questions, or extracting data. Text classification systems can be viewed as computing a function from documents to one or more class values. Most commercial text retrieval systems require users to enter such a function directly in the form of a boolean query. For example, the query (language OR speech) AND A U = Smith specifies a 1-ary 2-valued (boolean) function that takes on the value T R U E for documents that are authored by Smith and contain the word language or the word speech. In statistical IR systems, which have long been investigated by researchers and are beginning to reach the marketplace, the user typically enters a natural language query, such as Show me uses of speech recogni$ion. The assumption is made that the at tr ibutes (content words, in this case) used in the query will be strongly associated with documents that should be retrieved. A statistical IR system uses these at tr ibutes to construct a classification function, such as: f ( x ) .~ Cl ~shaw Jr C2~3ttse s -Jr C3Y3speech -~C4~recog~zitio~r t This function assumes that there is an at t r ibute corresponding to each word, and that a t t r ibute takes on some value for each document, such as the number of occurrences of the word in the document. The coefficients c~ indicate the weight given to each at tr ibute. The function produces a numeric score for each document, and these scores can be used to determine which documents to retrieve or, more usefully, to display documents to the user in ranked order: S p e e c h R e c o g n i t i o n Applications 0.88 Jones Gives S p e e c h at Trade S h o w 0.65 S p e e c h and S p e e c h Based Systems 0.57 Most methods for deriving classification functions from natural language queries use statistics of word occurrences to set the coefficients of a linear discriminant function [5,20]. The best results are obtained when supervised machine learning, in the guise of relevance feedback, is used [21,6]. Text categorization systems can also be viewed as computing a function defined over documents, in this case a k-ary function, where k is the number of categories into which documents can be sorted. Rather than deriving this function from a natural language query, it is typically constructed directly by experts [28], perhaps using a complex pat tern matching language [12]. Alternately, the function may be induced by machine learning techniques from large numbers of previously categorized documents [17,11,2]. 1 . 1 T e x t R e p r e s e n t a t i o n a n d T h e C o n c e p t L e a r n i n g M o d e l Any text classification function assumes a particular representation of documents. With the exception of a few experimental knowledge-based IR systems [15], these text representations map documents into vectors of att r ibute values, usually boolean or numeric. For example, the document title "Speech and Speech Based Systems" might be represented as
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملPalarimetric Synthetic Aperture Radar Image Classification using Bag of Visual Words Algorithm
Land cover is defined as the physical material of the surface of the earth, including different vegetation covers, bare soil, water surface, various urban areas, etc. Land cover and its changes are very important and influential on the Earth and life of living organisms, especially human beings. Land cover change monitoring is important for protecting the ecosystem, forests, farmland, open spac...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملUsing Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents
Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...
متن کاملUsing Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents
Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...
متن کامل